A simple (ie. no error checking or sensible engineering) notebook to extract the student answer data from a single xml file.
I'll also export the data to a csv file at the end of this, so that it's easy to read in at the beginning of another notebook.
Following discussions with Suraj, we want the representation to take into account the student's response, the official answer, and the grade. So there'll be a little fiddliness linking the student response back to the gold standard response.
So, first read the file:
In [3]:
filename='semeval2013-task7/semeval2013-Task7-5way/beetle/train/Core/FaultFinding-BULB_C_VOLTAGE_EXPLAIN_WHY1.xml'
It's an xml file, so we'll need the xml.etree parser, and pandas so that we can import into a dataframe:
In [4]:
import pandas as pd
from xml.etree import ElementTree as ET
In [7]:
tree=ET.parse(filename)
r=tree.getroot()
Now, the reference answers are in the second daughter node of the tree. We can extract these and store them in a dictionary. To distinguish between reference answer tokens and student response tokens, I'm going to append each token in the reference answers with _RA
, and each of the tokens in a student response with _SR
.
In [30]:
from string import punctuation
def to_tokens(textIn):
'''Convert the input textIn to a list of tokens'''
tokens_ls=[t.lower().strip(punctuation) for t in textIn.split()]
# remove any empty tokens
return [t for t in tokens_ls if t]
str='"Help!" yelped the banana, who was obviously scared out of his skin.'
print(str)
print(to_tokens(str))
In [50]:
refAnswers_dict={refAnswer.attrib['id']:[t+'_RA' for t in to_tokens(refAnswer.text)]
for refAnswer in r[1]}
refAnswers_dict
Out[50]:
Next, we need to extract each of the student responses. These are in the third daughter node:
In [41]:
print(r[2][0].text)
r[2][0].attrib
Out[41]:
In [58]:
responses_ls=[]
for (i, studentResponse) in enumerate(r[2]):
if 'answerMatch' in studentResponse.attrib:
matchTokens_ls=refAnswers_dict[studentResponse.attrib['answerMatch']]
else:
matchTokens_ls=[]
responses_ls.append({'accuracy':studentResponse.attrib['accuracy'],
'text':studentResponse.text,
'tokens':[t+'_SR' for t in to_tokens(studentResponse.text)] + matchTokens_ls})
responses_ls[36]
Out[58]:
OK, that seems to work OK. Now, let's define a function that takes a filename, and returns the list of token dictionaries:
In [66]:
def extract_token_dictionaries(filenameIn):
# Localise the to_tokens function
def to_tokens_local(textIn):
'''Convert the input textIn to a list of tokens'''
tokens_ls=[t.lower().strip(punctuation) for t in textIn.split()]
# remove any empty tokens
return [t for t in tokens_ls if t]
tree=ET.parse(filenameIn)
root=tree.getroot()
refAnswers_dict={refAnswer.attrib['id']:[t+'_RA' for t in to_tokens_local(refAnswer.text)]
for refAnswer in root[1]}
responsesOut_ls=[]
for (i, studentResponse) in enumerate(root[2]):
if 'answerMatch' in studentResponse.attrib:
matchTokens_ls=refAnswers_dict[studentResponse.attrib['answerMatch']]
else:
matchTokens_ls=[]
responsesOut_ls.append({'accuracy':studentResponse.attrib['accuracy'],
'text':studentResponse.text,
'tokens':[t+'_SR' for t in to_tokens_local(studentResponse.text)] \
+ matchTokens_ls})
return responsesOut_ls
We now have a function which takes a filename and returns a list of tokenised student responses and reference answers:
In [68]:
extract_token_dictionaries(filename)[:2]
Out[68]:
So next we need to be able to build a document frequency dictionary from a list of tokenised documents.
In [73]:
def document_frequencies(listOfTokenLists):
# Build the dictionary of all tokens used:
token_set=set()
for tokenList in listOfTokenLists:
token_set=token_set.union(set(tokenList))
# Then return the document frequency counts for each token
return {t:len([l for l in listOfTokenLists if t in l])
for t in token_set}
In [81]:
tokenLists_ls=[x['tokens'] for x in extract_token_dictionaries(filename)]
document_frequencies(tokenLists_ls)
Out[81]:
Next, define a function which takes a list of tokens and a document frequency dictionary, and returns a dictionary of the tf.idf values for each of the tokens in the list. Note: for this function, if a token isn't in the document frequency dictionary, then it won't be returned in the tf.idf dictionary.
We can use the collections.Counter object to get the tf values.
In [82]:
from collections import Counter
In [86]:
def get_tfidf(tokens_ls, docFreq_dict):
tf_dict=Counter(tokens_ls)
return {t:tf_dict[t]/docFreq_dict[t] for t in tf_dict if t in docFreq_dict}
In [88]:
get_tfidf('the cat sat on the mat'.split(), {'cat':2, 'the':1})
Out[88]:
Finally, we want to convert the outputs for all of the responses into a dataframe.
In [105]:
# Extract the data from the file:
tokenDictionaries_ls=extract_token_dictionaries(filename)
# Build the lists of responses:
tokenLists_ls=[x['tokens'] for x in extract_token_dictionaries(filename)]
# Build the document frequency dict
docFreq_dict=document_frequencies(tokenLists_ls)
# Create the tf.idf for each response:
tfidf_ls=[get_tfidf(tokens_ls, docFreq_dict) for tokens_ls in tokenLists_ls]
# Now, create a dataframe which is indexed by the token dictionary:
trainingText_df=pd.DataFrame(index=docFreq_dict.keys())
# Use the index of responses in the list as column headers:
for (i, tokens_ls) in enumerate(tfidf_ls):
trainingText_df[i]=pd.Series(tokens_ls, index=trainingText_df.index)
# Finally, transpose, and replace the NaNs with 0:
trainingText_df.fillna(0).T
Out[105]:
Cool, that seems to work. Now just need to do it for the complete set of files. Just use beetle/train/core for the time being.
In [107]:
!ls semeval2013-task7/semeval2013-Task7-5way/beetle/train/Core/
Use os.walk to get the files:
In [114]:
import os
We can now do the same as before, but this time using all the files to construct the final dataframe. We also need a series containing the accuracy measures.
In [137]:
tokenDictionaries_ls=[]
# glob would have been easier...
for (root, dirs, files) in os.walk('semeval2013-task7/semeval2013-Task7-5way/beetle/train/Core/'):
for filename in files:
if filename[-4:]=='.xml':
tokenDictionaries_ls.extend(extract_token_dictionaries(os.path.join(root, filename)))
# Now we've extracted the information from all the files. We can now construct the dataframe
# in the same way as before:
# Build the lists of responses:
tokenLists_ls=[x['tokens'] for x in tokenDictionaries_ls]
# Build the document frequency dict
docFreq_dict=document_frequencies(tokenLists_ls)
# Now, create a dataframe which is indexed by the tokens
# in the token frequency dictionary:
trainingText_df=pd.DataFrame(index=docFreq_dict.keys())
# Populate the dataframe with the tf.idf for each response. Also,
# create a dictionary of the accuracy values while we're at it.
accuracy_dict={}
for (i, response_dict) in enumerate(tokenDictionaries_ls):
trainingText_df[i]=pd.Series(get_tfidf(response_dict['tokens'], docFreq_dict),
index=trainingText_df.index)
accuracy_dict[i]=response_dict['accuracy']
# Finally, transpose, and replace the NaNs with 0:
trainingText_df=trainingText_df.fillna(0).T
# Also, to make it easier to store in a single csv file, let's put the accuracy
# values in a column (won't clash with any occurences of the token "accuracy"
# because we've changed the tokens to "accuracy_SR" and "accuracy_RA":
trainingText_df['accuracy']=pd.Series(accuracy_dict)
In [138]:
trainingText_df.head()
Out[138]:
And finish by exporting to a csv file:
In [141]:
trainingText_df.to_csv('beetleTrainingData.csv', index=False)
Done! Now can import the data into a dataframe with:
pd.read_csv('beetleTrainingData.csv')